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Abstract 

We present a formal specification of primary-backup. We then prove 
lower bounds on the degree of replication, failover time, and worst- 
case response time to client requests assuming different failure models. 
Finally, we outline primary-backup protocols and indicate which of 
our lower bounds are tight. 

Keywords: Fault-tolerance, reliability, availability, primary-backup, lower 
bounds, optimal protocols. 

1 Introduction 

One way to implement a fault-tolerant service is by using multiple servers 
that fail independently. The state of the service is replicated and distributed 
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among these servers, and updates are coordinated so that even when a subset 
of servers fail, the service will remain available. 

Such fault-tolerant services have been structured in several ways. One 
approach is to replicate the service state across all servers and to present 
each clients request to all nonfaulty servers in the same order. This ser- 
vice architecture is commonly called active replication or the state machine 
approach [Sch90] and has been widely studied from both theoretical and 
practical viewpoints ( e.g ., [PSL80, CASD85, JB89]). 

Another approach to building replicated services is to designate one 
server as the primary and all the others as backups. Clients make requests 
by sending messages only to the primary. If the primary fails, then a failover 
occurs and one of the backups takes over. This service architecture is com- 
monly called the primary-backup or the primary-copy approach [AD76] and 
has been widely used in commercial fault- tolerant systems. However, the 
approach has not been analyzed as extensively as the state machine ap- 
proach, and little is known of the costs and tradeoffs, the degree of repli- 
cation required, and the worst-case response time for various failure mod- 
els. In this paper, we derive some of these tradeoffs. For example, some 
primary-backup protocols use more servers than the number of failures to 
be tolerated [LGG + 91]. We are able to show that the number of servers 
needed depends on the failure model. 

The key difference between the active replication and primary-backup 
approaches is how each masks failures. With active replication, server fail- 
ures are completely masked by voting and the service implemented is that of 
a single non-faulty server. With the primary-backup approach, a request to 
the service can be lost if it is sent to a faulty primary. 1 Thus, clients can now 
observe the effects of server failures. Periods during which requests are lost, 
however, are bounded by the length of time that can elapse between failure 
of the primary and takeover by a backup. Such behavior is an instance of 
what we call a bofo service (bounded outage finitely often). Specifically, a 
service outage occurs at time t if some client makes a request at that time 
but never receives a response to that request. 2 Furthermore, in a (fc, A)- 
bofo service, all service outages can be grouped into at most k intervals of 
time, with each interval having a length of at most A. Accordingly, even 
though some requests may not elicit a response from a bofo service, not 
too many will. Note that if clients are restricted to send requests only to 

'The client can subsequently resend a copy of that request to the new primary. 

3 For simplicity, we assume in this paper that every request elicits a response. 
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a single server, then one cannot implement a service that is stronger than 
bofo. This is because if the client sends a request to a server and the server 
subsequently crashes, then the request can be lost and will not be processed. 

In this paper, we give lower bounds for implementing a bofo service 
using the primary-backup approach. These lower bounds depend on the 
message delivery delay and the kinds of failures that can be tolerated. The 
lower bounds constrain the degree of replication, the time during which the 
service can be without a primary, and the worst-case response time of client 
requests. In some cases the results are surprising. For example, more than 
/ + 1 servers are necessary to tolerate / failures of certain types (crash and 
link failures, receive-omission failures, or general-omission failures). Also, if 
a majority of the servers can be faulty, then any primary-backup protocol 
for receive-omission failures will have a run in which the primary is non- 
faulty, but it is forced to become a backup, while a server that is faulty 
becomes the primary in its place. 

Finally, in this paper we outline some primary-backup protocols. This 
allows us to determine which of our lower bounds are tight. 

The paper is organized as follows. Section 2 gives a formal specification 
of a primary-backup protocol. Section 3 defines our system model. Sec- 
tion 4 discusses the lower bounds, and in Section 5 we outline our protocols 
and state which of the previously-shown bounds are tight. We conclude in 
Section 6. 

2 Primary— Backup Protocols 

To derive lower bounds, we have to give a precise definition of a primary- 
backup protocol. We believe that the following four properties characterize 
a primary-backup protocol and note that many primary- backup protocols 
( e.g . [AD76, Bar81, Cen87, BEM91]) satisfy this characterization. 

Pbl: There exists predicate Prmy t on the state of each server a. At any 
time, there is at most one server a whose state satisfies Prmy,. 3 

For brevity, whenever we say that “a is the primary (at time t) n we mean 
that the state of a satisfies Prrny,. Note that the failover time for a service 
is the longest period of time during which Prmy s is not true for any a. 

3 The protocol of [LGG+91] allows concurrent primaries, but only for bounded periods. 
If one replaces Pbl by this property, then except for the bounds on failover times, the 
bounds shown in Section 4 continue to hold. 
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Pb2: Each client i maintains a server identity Desti such that to make a 
request, client t sends a message only to Desti. 

Property Pb2 distinguishes the primary- backup approach from active repli- 
cation, where each client sends requests to every server in the service. 

For the next property, we model a communications network by assuming 
that client requests are enqueued in a message queue of a server. 

Pb3: If a request arrives at a server that is not the primary, then the request 
is not enqueued (and is therefore not processed). 

Properties Pbl-Pb3 specify a protocol for interacting with a service, but 
not the semantics of the service. For example, the properties do not rule out 
a primary that ignores all requests. A fourth property eliminates such trivial 
implementations by stipulating that the server be bofo for some values of k 
and A: 

Pb4: There exist fixed values k and A such that the service behaves like a 
single (k, A)-bofo server. 

This property is not implementable if the number of failures is not a priori 
bounded. Assuming a bounded number of failures is just a modeling trick. 
When the number of failures is unbounded, bounding the rate of failures 
and including reintegration of recovered servers can provide service outages 
of bounded lengths. We do not address failure rates or reintegration in this 
paper. 

A Simple Primary-Backup Protocol 

As an example of a service based on the primary-backup approach, consider 
the following protocol, which tolerates a single server crash. Assume that all 
communication is over point-to-point nonfaulty links and that each link has 
an upper bound 6 on message delivery time 4 . Refer to Figure 1. There is 
a primary server p\ and a backup server P 2 connected by a communications 
link . A client initially sends all requests to p\ (indicated by the arrow labeled 
1 in the figure). Whenever p\ receives such a request, it 

• processes the request and updates its state accordingly, 

4 To simplify exposition, we assnme that the maximum message delay between the 
clients and the servers is the same as the delay between the servers. However, our results 
can be easily extended to the case when the delays are different. 
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Figure 1: A Simple Primary-Backup Protocol. 

• sends information about the state update to p? (message 2 in the 
figure), 

• without waiting for an acknowledgement from pj, sends a response to 
the client (message 3 in the figure). 

The order in which these messages are sent is important because it guaran- 
tees that if the client receives a response, then either p 2 has received message 
2 or p 2 has crashed. 

Server P 2 updates its state upon receiving update messages from p\. In 
addition, pi sends messages to p? every r seconds. If P 2 does not receive 
such a message for r + 6 seconds, then pj becomes the primary. Once p 2 
has become the primary, it informs the clients (who update their copies of 
Dest) and begins processing any subsequent requests sent by them. 

We now show that this protocol satisfies our characterization of a primary- 
backup protocol. Property Pbl requires that there never be two primaries. 
This is satisfied by the following definitions of Prmy : 

Prmy Pl =* (pi has not crashed) 

Prmj/p, d = (P 2 has not received a message for r + <5) 

The predicate Prmy pi A Prmy n is always false in a system executing our 
protocol, and hence Pbl is satisfied. The failover time for this protocol is 
the longest interval during which -i Prmy Pi A -iPrmy p 7 can hold, and it is 
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t + 26 seconds. Property Pb2 follows trivially from the description of the 
protocol. Property Pb3 is true because requests are not sent to P 2 until 
after p\ has failed. Finally, Pb4 requires that the protocol implements a 
single bofo server for some values of k and A. Since pi sends message 2 
before message 3, it will never be the case that pi sends a response to the 
client, and p 2 does not get information about that response from p\. Using 
this fact, it can be shown that the service behaves like a single server. To 
compute k and A, we can let k = 1 and so it suffices to compute the longest 
interval during which a client request may not elicit a response. Assume 
that pi crashes at time t c . Any request sent at t c — 6 or later may be lost 
since p\ crashes at t c . Furthermore, p 2 may not learn about pi’s crash until 
t c + r + 26, and clients may not learn that p 2 is the primary for another 
6. So, the total period during which a request may not elicit a response is 
t c — 6 through t c + r + 36: the service is equivalent to a single ( 1 , r + 4<5 )-bofo 
server. 

3 The Model 

We consider a system consisting of n a servers and n c clients. We assume 
that server clocks are perfectly synchronized with real time. 5 Clients and 
servers communicate by exchanging messages through a completely con- 
nected point-to-point network. Each message sent is enqueued in a queue 
maintained by the receiving process, and a process accesses its message 
queue by executing receive. We assume that links between processes are 
FIFO (t.e. if pi sends message m followed by m! to process pj, then pj will 
never receive m after m') and if processes p; and pj are connected by a (non- 
faulty) link, then a message sent from p,- to pj at time t will be enqueued in 
Pj’ s queue at of before t + 6. 

We are interested in identifying the costs inherent in primary-backup 
protocols, and so we assume that it takes no time for a server to compute a 
response. We also assume that a client can send a request at any time. 

We model execution of a system by a run, which is a sequence of times- 
tamped events involving clients, servers, and the message queues. -These 
events include sending messages, enqueuing messages, receiving messages, 
and modeling computation at processes. Two runs <J\ and <r 2 of the system 
are indistinguishable to a process p if the same sequence of events (with the 

^Extension to the case where clocks are only approximately synchronized [LMS85] is 
discussed in [Bnd93]. 
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same timestamps) occur at p in both 0 \ and 02- We assume that if two runs 
o 1 and 02 are indistinguishable to p, then at any time f, the state of p at 
time t in o\ is the same as the state of p at time t in 02. Again, it is not hard 
to extend our definition of indistinguishability to handle nondeterministic 
servers; the current definition does not. 

We consider the following hierarchy of failure models: 

Crash failures: A server may fail by halting prematurely. Until it halts, it 
behaves correctly. After it halts, a timeout can detect this fact . 6 

Crash-h Link failures: A server may crash or a link may lose messages (but 
not delay, duplicate or corrupt messages). 

Receive- Omission failures: A server may fail not only by crashing, but also 
by omitting to receive some of the messages sent to it over a nonfaulty 
link. 

Send-Omission failures: A server may fail not only by crashing, but also by 
omitting to send some of the messages over a nonfaulty link. 

General-Omission failures: A server may exhibit send-omission and receive- 
omission failures. 

Figure 2 illustrates this failure hierarchy. Note that crash+link failures 
and the various types of omission failures are distinct. Although both rep- 
resent loss of messages, each is dealt with by a different masking technique. 
In particular, crash+link failures can be masked by adding redundant com- 
munication paths, while omission failures can only be masked by adding 
sufficient redundant servers so that faulty processes can detect their own 
failure and halt. We discuss these masking techniques in Section 5. 

Henceforth, we assume that no more than f, servers can be faulty, and 
for crash+link failures that no more than /; links can be faulty. 

4 Lower Bounds 

We now give lower bounds for implementing a single (k, A)-bofo server using 
the primary-backup approach for each failure model. 

9 The lower bounds we derive for crash failures also hold for fail-stop failures [SS83] 
except for the bound on failover time. The lower bound on failover time depends on the 
maximum duration between when a server p, fails and when failedi becomes true. 
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Figure 2: Failure Hierarchy 
4.1 Bounds on Replication 

The first theorem is obvious. However, to introduce our notation and the 
proof technique that will be used later in the section, we give a formal proof 
of the theorem. 

Theorem 1 Any primary-backup protocol tolerating f, crash failures re- 
quires n 3 > f„ + 1. 

Proof: We prove the result by contradiction. Suppose there is a protocol 

P for n a < f a + 1. Thus, P satisfies Pb4. Consider a run in which all n a 
servers are crashed initially and clients submit R > fcfA/<f| requests, where 
d is the minimum time between the sending of any two requests (d > 0). By 
Pb4, at least one of these requests must elicit a response. This is because 
the number of requests that cannot have responses must fall into at most k 
intervals of length at most A, and each interval of A can contain at most 
[A/cf| requests. However, such a response is impossible since, by assump- 
tion, all servers have crashed. □ 

The following lemma will be used for the rest of the theorems in this 
section. 

Lemma 4.1 Consider any protocol that satisfies Pb4. Suppose two disjoint 
and nonempty sets of servers A and B can be found that meet the following 
three properties: 
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1. There exists a run o a containing R > 2k[A/d] requests where d is the 
minimum time between the sending of any two client requests ( d > 0). 
Furthermore, in this run the servers in A do not crash and all other 
servers crash at time 0. 

2. There exists a run a /, containing R requests. Furthermore, in this run 
the servers in B do not crash and all other servers crash at time 0. 

8. There exists a run <r a ), containing R requests. Furthermore, the servers 
in A and B do not crash, <7 0 (, is indistinguishable from c a to all servers 
in A, and is indistinguishable from <74 to all servers in B. 

At least one of the above runs violates Pb2. 

Proof: Suppose for contradiction that the lemma is false and runs <r a , <74 

and c a k all satisfy Pb2. 

For o a , by Pb4 at least R - k\A/d] of the requests must have been re- 
ceived by servers in A. Similarly, for <74, at least R-k\A/d\ of the requests 
must have been received by servers in B. Finally, since 0 ^ is indistinguish- 
able from a a to servers in ^4, they must execute the same number of receive 
events in both runs. The same holds for the servers in B. By Pb2, each 
request is sent to at most one server and so at least 2 (R — kfA/d]) requests 
must have been sent in <7„4. Since only R requests were sent, we must have 
R > 2{R - k\A/d]), or R < 2fc[‘A/ef|, which contradicts the assumption 
that R > 2fcfA/cf|. 

□ 

Theorems 2 and 3 depend on two parameters of primary-backup proto- 
cols. Let T be the maximum time between any two successive client requests 
(possibly from different clients) in any run of the system, and let D be a 
duration such that if some server s becomes the primary at time to and re- 
mains the primary through time t > to + D when a client c,- sends a request, 
then Desti = s at time t. For simplicity of notation, we will write D < T to 
mean that D is bounded and T is either unbounded or bounded and greater 
than D. 

With both send-omission failures and crash+link failures, messages may 
fail to reach their intended destinations. The following theorem shows that 
crash+link failures are more expensive to tolerate as they require more repli- 
cation. 
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Theorem 2 Suppose there is at most one link between any two servers 
and the total number of server and link failures that can occur is f , where 
f < min(f a , fi). Then any primary-backup protocol tolerating crash+link 
failures and having D < T requires n, > f + 2. 

Proof: For contradiction, assume the existence of a protocol P with 

n, < / + 2. We will show that P has three runs <r a , 04 and 044 that satisfy 
the conditions of Lemma 4.1. From the lemma, at least one of these runs 
violates Pb2, which implies that P cannot be a primary-backup protocol. 

Let A be a set containing the one server s a and let B be the set of 
remaining servers. Since |A| = 1 and \B\ = n a — 1 < f, A and B can be 
partitioned by link failures. 

We first construct the run cr a b in which no server crashes, postulating 
that the links between the servers in n a and nj are faulty and do not deliver 
any messages. As required by Lemma 4.1, clients will send a total of R > 
2k{A/d] requests. Let 0 < d < T — D be the minimum interval between any 
two such requests. We postulate that a request will be sent at time t iff no 
request has been sent during the interval [t — d..t) and one of the following 
rules hold. 

1. A server a is the primary during the interval [t — D..t]. This request 
arrives immediately and is enqueued (at a, by Pb3 and the definition 
off?). 

2. There is no primary at time t. This request arrives immediately and 
by Pb3 will never be enqueued at any server. 

3. A server a is the primary at time t but another server s' is the primary 
immediately after time t. If this request is sent to a, then it arrives 
after t, and if it is sent to any other server, then it arrives immediately. 
In both cases, it arrives at a server that is not the primary, and so will 
not be enqueued (again by Pb3). 

Note that, by construction, the maximum interval between any two client 
requests is D + d. This interval occurs when a server a becomes the primary 
just before d after a client message is sent, and a remains the primary for 
at least D. Hence, the client will be able to send R requests within time 
R(D + d). This completes the construction of o a b. 

We now construct <T a and <74, recalling that in a a all of the servers except 
s a crash at time 0, and in <74 server a a crashes at time 0. The clients send the 
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same requests and at the same times in <r a and in a j as in <7 a j. Furthermore, 
by construction these requests will arrive at the servers according to the 
same rules used in constructing o a b- Of course, a client request may not be 
delivered to the same servers in o a or Ob as in <r 0 j, since different servers are 
operational in these runs. 

Since s a does not receive any messages from servers in B in either cr a b 
or <7 0 , these two runs are indistinguishable to s a as long as it receives the 
same client requests at the same times in both runs. We that this is the case 
by contradiction: let t be the earliest time that s a can distinguish between 
these two runs. 

Thus, at time t either s a received a request m in a a j but not in <r a or it 
received a request m in o a but not in <T a j. We will assume the former; the 
proof for the latter is similar. The request m must have been enqueued at 
some time t' < t at s a in o a b- Since m was received by s a , m must have been 
sent by rule 1 . By rule 1 , s a must have been the primary through [t' - D..t '] 
in <T a b and therefore, by indistinguishability, in <r a as well. By the definition 
of D, m would have been enqueued at s a at time t' in o a as well. 

Since s a cannot distinguish between the runs before f, s a cannot receive 
m before t in <r a , and s a must execute a receive in both c a and c a b at time 
t. So, it must be the case that s a receives another request m' ?! m at time t 
in <7 a . Assume that m! was enqueued at time t". By an indistinguishability 
argument similar to above, m' must be enqueued at time t" at s a in a a b as 
well. Therefore, if a received m' in c a at time t, it must receive m' in <r a b as 
well, a contradiction. 

A similar argument can be used to show that the servers in rib receive 
the same requests in <Tb and <r a b, and so these two runs are indistinguishable 
to the servers in n&. Thus, by Lemma 4.1 P cannot be a primary-backup 
protocol. □ 

The next theorem states that additional replication is required in order to 
tolerate receive-omission failures. The proof is similar to that of Theorem 2, 
and so it is omitted. 

Theorem 3 Any primary-backup protocol tolerating receive-omission fail- 
ures and having D <T requires n, > . 

The next lower bound holds independent of the relation between D and 
T. However, before we prove the result, we need the following definitions. 

Define -< to be the potential causality relation [Lam78] on server events 
ei and e 2 as follows: e\ -< e 2 iff 
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1 . Both e x and e 2 occur at the same server s and e\ occurs before or 

2 . e\ is a send event and e-i is the corresponding receive event or 

3 . ( 3 e: e\ < e A e < ej) 

We say that a request m is an update request iff in any run a for which m 
has a response r, any other response r' sent after r in real time causally 
follows m, i.e. if event e(m) corresponds to the receipt of m and event 
e(r') corresponds to the sending of r', then e(m) -< e(r'). A primary-backup 
protocol is trivial to implement if there are no update requests, and so we 
assume that update requests exist and that clients can send them at any 
time. 

Theorem 4 Any primary-backup protocol tolerating general-omission fail- 
ures requires n s > 2 /,. 

Proof: Assume for contradiction that there is a protocol for n 4 < 2 f,. 

Partition the servers into two disjoint sets A and B of size at most f, each. 
We will construct two runs <ti and <Ji . In each run, one set of servers will 
be faulty and the other set will be nonfaulty. 

o\\ The servers is A are faulty and fail to communicate with all servers 
in B , but behave correctly otherwise. Clients send update requests until 
the first response is sent (this must happen, by Pb 4 ). Assume that the first 
response r to a request is sent at time t. Say that this response is sent by 
server s. 

<T2‘. The same as c\ up to time t, but if s is in B, then in 02 the servers 
in B are faulty and fail to communicate with all servers in A. In either case, 
no server cam distinguish <Ti from 02 through time t and therefore, the first 
response r is sent at time t in 0-1 as well. 

By construction, r is sent by a faulty server in 02- Let all of the faulty 
servers in 02 crash immediately after r is sent and have clients continue to 
send requests until another response r' is sent. This response must have 
been sent by a nonfaulty server which implies that -«(e(m) -< e(r')). How- 
ever this violates the fact that m is an update request. □ 


4.2 Bounds on Blocking 

Informally, a blocking primary-backup protocol is one in which the primary 
must, subsequent to receiving a request m, either receive a message from 
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another server or simply wait an interval before it can respond to m. We 
say that a primary- backup protocol is C-blocking if any request (received, 
say, at t m ) elicits a response in a failure-free run at time t r , then t r - t m < 
C. For example, any primary- backup protocol in which the primary sends 
information about a request to the backups and waits for acknowledgement 
before sending the response to the client will be at least 2 (5-blocking. 

As shown in Section 5, 0-blocking primary-backup protocols are possible 
for crash and crash+link failure models. The simple protocol tolerating 
crash failures presented in Section 2 is 0-blocking. We call such protocols 
nonblocking because the primary can send a reply to the client as soon as 
the reply is computed. Nonblocking protocols tolerating receive-omission 
failures are also possible as long as n a > 2 /,, but there is no nonblocking 
primary-backup protocol tolerating send-omission failures. 

Theorem 5 Any primary-backup protocol tolerating receive-omission fail- 
ures with f» > l, n, < 2f, and D < T is C-blocking for some C > 26. 


Proof: For contradiction, suppose there is a primary-backup protocol 

for n t < 2f t and f t > 1 that is C-blocking where C < 26. Partition the 
servers into two sets A and B where |A| = /, and |i?| = «» — /»</». We 
construct three runs. In all three runs, assume that all server messages take 
6 to arrive. 

<Ti: There are no failures and all client requests take 6 to arrive. More- 
over, clients send update requests until some request m evokes a response 
r. Let m be received at time t m by server p 6 A and r be sent at time t r 
by a different server q € A. Notice that since the protocol is C-blocking 
where C < 26, t r — t m < 26. Also, since by construction all requests take 
6 to arrive, all client requests sent after time t m + 6 will be received after 
time t r . 

02 ’. Identical to <7\ until p receives m at time t m ._ At this point in <72, all 
servers in A are assumed to crash and clients are assumed to send no request 
during the interval [f m + 6..t r \. Finally, after time t r clients are assumed to 
repeatedly send requests at intervals of at least d where Q<d<T — Das 
follows. A request is sent at time t if no request has been sent in [t - d..t ) 
and one of the following rules hold. 

1. A server s € B is the primary during the interval [t - D..t\. This 
request arrives immediately and is enqueued (at s, by Pb3 and the 
definition of D). 
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2 . There is no primary in B at time t. This request arrives immediately 
by Pb 3 will never be enqueued at any server. 

3 . A server s € B is the primary at time t but another server s' € B 
is the primary immediately after time t. If the request is sent to s, 
then it arrives after t , and if it is sent to any other server it arrives 
immediately. In both cases, it arrives at a server that is no the primary, 
and so will not be enqueued (again, by Pb 3 ). 

Notice that eventually, there will be a response (say r') in 02 because 
the protocol satisfies Pb 4 , and by construction it must be from a request 
sent by rule 1 . 

<73: The same as <73, except that the servers in A do not crash at time 
t m . Instead, the servers in B commit receive failures on all messages sent 
after t m by servers in A. Clients send requests at the same times as in 03 
which arrive using the same rules as <73. 

Now, consider these three runs. By construction, the runs are identical 
up to time t. Since all server messages take 6 to arrive, clients cannot dis- 
tinguish <Tj and <73 through t m +6, and so clients send the same requests to 
the same servers in both <7\ and <73. Similarly, since all server messages take 
6 to arrive, the servers in B cannot distinguish between 01 and <73 through 
t m + 6. Therefore, since t r — t m < 26 , p (the server that received request 
m at time t m in 03 ) and q (the server that sent response r at time t T in 
c7i) cannot distinguish between o\ and <73 through time t T , and so q sends 
response r in 93 as well. Then, using an argument similar to the one in 
Theorem 2 , servers in B cannot distinguish <73 and <73, and so response r' 
also occurs in 03. However, ->(e(m) ~< e(r')) which violates the assumption 
that m is an update request. □ 

In run <73 of the above proof, a correct primary ( p in set A) becomes 
the backup, while a faulty server from set B becomes the primary in p’s 
place. It is always possible to construct such a run. This is a disconcerting 
property: there does not exist a primary-backup protocol that tolerates 
receive-omission failures with n a < 2f t in which a primary cedes only when 
it fails. Moreover, his lower bound is tight — we have constructed a receive- 
omission primary-backup protocol with n, a 2/, + 1 in which a primary 
cedes only when it fails. 

The above lower bound holds only if /, > 1. If /, = 1, then the following 
theorem holds. Its proof is similar to the proof of Theorem 5 , except that 
p = q. 
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Theorem 8 Any primary-backup protocol tolerating receive-omission fail- 
ures with f s = 1 and n a < 2f s and having D < T is C -blocking for some 

C >6. 

Primary-backup protocols tolerating send-omission failures exhibit the 
same blocking as those tolerating receive-omission failures: 

Theorem 7 Any primary-backup protocol tolerating send-omission failures 
and f s > 1 is C -blocking for some C >26. 

Proof: For contradiction, suppose there is a primary-backup protocol 

that is C-blocking where C < 26. We consider the following two runs in 
which all server messages take 6 to arrive. 

<j\: There are no failures and all client requests take 6 to arrive. More- 
over, clients send update requests until some request m evokes a response r. 
Let m be received at time t m by server p and r be sent at time t r by a dif- 
ferent server q. Notice that since the protocol is C-blocking where C < 26, 
t r — < 2$. Also, since by construction all requests take 6 to arrive, all 

client requests sent after time t m + 6 will be received after time t T . 

cr 2 : Identical to <r t through t m . After t m , p and q fail and omit to send 
all messages to all servers except each other. Since by construction all mes- 
sages take 6 to arrive, servers and clients cannot distinguish between o\ and 
(72 through t m + 6, and as a result p and q cannot distinguish the two runs 
through t m + 26. Therefore, since t r — t m < 26, q sends the response r at 
time t r in <72 as well. Now let p and q crash at time t T and the clients send 
requests after time t r . By Pb4, there eventually must be some request m' 
that results in a response r'. However, ->(e(m) -< e(r')), which violates the 
assumption that m us an update request. □ 


Theorem 8 Any primary-backup protocol tolerating send-omission failures 
and f s = l is C-blocking for some C >6. 

4.3 Bounds on Failover Times 

The failover time is the longest interval during which Prmy, is not true for 
any server s. In this section, we present lower bounds on failover times. 
In order to discuss these bounds, we postulate a fifth property of primary- 
backup protocols: 

Pb5: A server that is the primary remains so until there is a failure. 
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This is a reasonable expectation and it is valid for all protocols that we have 
found in the literature. 

Theorem 9 Any primary-backup protocol tolerating f, crash failures must 
have a failover time of at least f s 6. 

Proof: Assume that the theorem is false. We derive a contradiction by 

induction on /„. 

Base case f, = 0: trivially true since the failover time cannot be 
smaller than zero. 

Induction case f 3 > 0 : suppose the theorem holds for at most f 3 — 1 
failures, but there is a protocol P for which the theorem is false when there 
are f 3 failures. From the induction hypothesis, there is a run a with at most 
/, — 1 failures and an interval [<0”*i] at least (f 3 — 1 )<S during which there is 
no primary. Let pi be the server that becomes the primary at f i . Consider 
the two runs 0\ and a<i that extend o as follows: 

<Ti: Assume p\ crashes at time t\. By assumption, there exists a new 
primary (say p?) at time <2 < h + S. Since pi crashes at time 1 1, pi does 
not receive any messages from pi that were sent sent after time t\. 

<r 2 : Assume p\ does not crash but all messages sent by p\ after time (1 
take 6 to arrive. 

Since P2 cannot distinguish <j\ from 02 through time *2, Pi becomes the 
primary at time <2 in oi- By Pb 5 , however, p\ remains the primary at time 
$2 in 02- This violates Pbl, and so P is not a primary-backup protocol. □ 

The failover times for all other failure models have a larger lower bound. 

Theorem 10 Any primary-backup protocol tolerating f crash+link failures, 
where f < min(f 3 , f\), has a failover time of at least 2 f6. 

Proof: We again assume that the theorem is false and derive a contra- 

diction. 

Base case / = 0 : trivially true. 
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Induction case / > 0: suppose the theorem holds for at most / - 1 
failures, but there is a protocol P for which the theorem is false when there 
are / failures. 

From the induction hypothesis, there is a run a with at most / - 1 
failures and an interval [f 0 ..ti] at least (/ - 1)£ during which there is no 
primary. Let p\ be the server that becomes the primary at t\. Consider the 
three runs o\, a 2 and, o 3 that extend a as follows: 

o\\ Assume that p\ crashes at time t\ and that all messages sent after 
take 6 to arrive. By assumption, there exists a new primary (say p 2 ) at time 
h < ti + 26. Since p\ crashes at time t\ , p 2 does not receive any messages 
from pi that were sent after time t\. Furthermore, since all messages take 6 
to arrive, any message that was sent after ti + 6 can be received by p 2 only 
after time t 2 . 

<r 2 : Assume that pi does not crash and that all messages sent after time 
t\ take 6 to arrive. Since there are no failures after time ti, by Pb5 pi 
continues to be the primary through time t 2 . 

<7 3 : The same as <r 2 except that the link between pi and p 2 is faulty and 
does not deliver any message sent by pi to p 2 after time t\. 

By construction, p 2 cannot distinguish o\ from <r 3 through time t 2 , and 
so p 2 becomes the primary at time t 2 in cr 3 . Similarly, pi cannot distinguish 
oi from a 3 through time t 2 and so pi remains the primary until time t 2 in 
<7 3 . This violates Pbl, and so P is not a primary-backup protocol. □ 

We omit the proofs of the following two theorems because they are similar 
to Theorem 9. 

Theorem 11 Any primary-backup protocol tolerating f, receive-omission 
failures has a failover time of at least 2 f t 6. 

Theorem 12 Any primary-backup protocol tolerating f, send-omission fail- 
ures has a failover time of at least 2f,6. 

5 Outline of the Protocols 

In order to establish that the bounds given above are tight, we have de- 
veloped a set of primary-backup protocols for the different failure mod- 
els [BMST92]. In this section, we outline these protocols and use them to 
show which of the lower bounds in the previous sections are tight. 
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Our protocol for crash failures is similar to the protocol given in Sec- 
tion 2. Whenever the primary receives a request from the client, it processes 
that request and sends information about state updates to the backups be- 
fore sending a response to the client. All servers periodically send messages 
to each other in order to identify server failures. This protocol uses (/, + 1) 
servers and is 0-blocking. Thus, Theorem 1 is tight and this protocol uses 
the optimal number of servers and incurs no additional delay. Furthermore, 
this protocol has the failover time f a 6 + r for arbitrarily small and positive 
r, and so Theorem 9 is tight. 

In order for the protocol to tolerate crash+link failures, we add an addi- 
tional server. By Theorem 2, this server is necessary. The additional server 
ensures that there is always at least one nonfaulty path between any two 
correct servers, where a path contains zero or more intermediate servers. 
The crash failure protocol outlined above is now modified so that a primary 
ensures any message sent to a backup is sent across at least one nonfaulty 
path. Note that this protocol uses (/ + 2) servers and is 0-blocking. Thus, 
Theorem 2 is tight and this protocol uses the optimal number of servers and 
incurs no additional delay. Furthermore, this protocol has the failover time 
2 fS + t for arbitrarily small and positive r, and so Theorem 10 is tight. 

Most of our protocols for the different kinds of omission failures apply 
translation techniques [NT88] to the crash failure protocol. These techniques 
ensure that a faulty server detects its own failure and halts. The translations 
assume a round-based protocol. Since our crash failure protocol is not round- 
based, we must modify the translations so that a server can send and receive 
messages at any time rather than just at the beginning or the end of a 
round. This is not difficult to do. All of these resulting omission protocols 
have failover time 2f t S + r, and thus Theorems 11 and 12 are tight. The 
protocol for send-omission failures uses /, + 1 servers and is 26 + r-blocking. 
Furthermore, we also have a send-omission protocol for f a — 1 that is 6- 
blocking. Thus, Theorems 7, 8 and 12 are tight. The protocol for general- 
omission failures also uses 2f a + 1 servers and is 26 + r-blocking, and so 
Theorem 4 is tight, and Theorems 7 and 12 are tight for general-omission 
failures as well. 

We have not been able to determine whether Theorems 3 and 5 are 
tight. Our protocol tolerating receive-omission failures uses 2f a + 1 servers 
whereas the lower bound in Theorem 3 only requires n, > . We have 

constructed receive-omission protocols for n a — 2, /, = 1 and n, = 4, f a = 2 
but have not been able to generalize the protocols. The protocols in this 
region have the odd property that a nonfaulty primary can cede to a faulty 
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f D < r. 


Table 1: Lower Bounds. 

primary, and so we do not expect such protocols to have much practical 
importance. However, the protocol for n, = 2, /, = 1 is ^-blocking and so 
Theorem 6 is tight. 

Table 1 summarizes all of our results. 


6 Discussion 

This paper gives a formal characterization of primary-backup protocol for a 
synchronous system. It presents lower bounds on the degree of replication, 
the blocking time, and the failover time for a primary-backup protocol under 
various kinds of server and link failures. A set of primary-backup protocols 
is outlined and used to show which of our lower bounds are tight. 

It is instructive to compare our results to existing primary-backup pro- 
tocols. A two-server primary-backup protocol that tolerates crash+link 
failures is presented in [Bar81], which seemingly contradicts Theorem 2. 
However, this protocol assumes that there are two links between the two 
servers which effectively masks a single link failure. Hence, only crash fail- 
ures need to be tolerated which can be accomplished using only two servers 
(Theorem 1). 

A more ambitious primary-backup protocol is presented in [LGG + 91]. 
This protocol tolerates the following failure model (quoted from [LGG + 91]): 
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The network may lose or duplicate messages, or deliver them late 
or out of order; in addition it may partition so that some nodes 
are temporarily unable to send messages to some other nodes. As 
is usual in distributed systems, we assume the nodes are fail-stop 
processors and the network delivers only uncorrupted messages. 

This failure model is incomparable with the hierarchy we present. However, 
the protocol does tolerate general-omission failures and has optimal degree 
of replication as it uses n, = 2f a + 1 servers. 

In Theorem 2, we assumed that D < T. This assumption is crucial: 
we have constructed a two-server primary-backup protocol tolerating one 
crash-flink failure for which D > T. Recall that link failures are masked 
by adding redundant paths between the servers. Our two-server crash+link 
protocol essentially uses the path from the primary to the backup through 
the client as the redundant path. Thus, there appears to be a tradeoff 
between the degree of replication and the time it takes for a client to learn 
that there is a new primary. 

The lower bounds on failover times given in Section 4.3 were derived 
assuming Pb5. We have constructed primary-backup protocols that have 
failover times smaller than the lower bounds given in Section 4.3, and as 
expected these protocols do not satisfy Pb5. This smaller failover time is 
achieved at a cost of an increased variance in service response time. 

Finally, we have attempted to give a characterization of primary-backup 
that is broad enough to include most synchronous protocols that are con- 
sidered to be instances of the approach. There axe protocols, however, that 
are incomparable to the class of protocols we analyze [BJ87]. In addition, 
the protocols in [OL88, MHS89] are incomparable since they were devel- 
oped for an asynchronous setting. Such protocols cannot be cast in terms 
of implementing a (k, A)-bofo service for finite values of k and A. We are 
currently studying possible characterizations for a primary-backup protocol 
in an asynchronous system and expect to extend our results to this setting. 
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